-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix v18 tablets removed from healthcheck when topo server GetTablet call fails #201
Fix v18 tablets removed from healthcheck when topo server GetTablet call fails #201
Conversation
…all fails Signed-off-by: Brendan Dougherty <[email protected]>
@@ -190,6 +190,17 @@ func (tw *TopologyWatcher) loadTablets() { | |||
topologyWatcherOperations.Add(topologyWatcherOpGetTablet, 1) | |||
<-tw.sem // Done; enable next request to run | |||
if err != nil { | |||
if !topo.IsErrType(err, topo.NoNode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the error be a NoNode
even when it's a network partition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the NoNode
error means the path does not exist in zookeeper. We would get a NoNode
error if the tablet was removed from the topo server between the call to get all the tablet aliases (tw.getTablets(tw)
) and the call to get that tablet (tw.topoServer.GetTablet(tw.ctx, alias)
). So we allow the tablet to be removed from the healthcheck on a NoNode
error.
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
…thcheck-on-topo-error Fix v18 tablets removed from healthcheck when topo server GetTablet call fails (cherry picked from commit a46ede6)
This is the v18 fix for this bug: vitessio#15632
The code for getting tablets for healthchecks was significantly altered between v18 and v19, so even though the bug is similar, the fix is different.
Description
In v18,
TopologyWatcher.loadTablets()
roughly works like this:newTablets
mapnewTablets
newTablets
:tw.tablets
(the current set of tracked tablets)tw.tablets
(the current set of tracked tablets)newTablets
So the bug in v18 is that if a GetTablet call fails, the tablet is not added to
newTablets
, and thus removed from the healthcheck when iteratingtw.tablets
.Changes
TopologyWatcher.loadTablets()
to handle GetTablet errors by backfillingnewTablets
with the current value fromtw.tablets
. This is the same approach used in the v19+ fix.TestGetTabletErrorDoesNotRemoveFromHealthcheck
test from the v19+ fixTestGetTabletNoNodeErrorRemovesFromHealthcheck
to cover the case where a tablet is removed from the topo server bewteen the call togetTablets
and the call toGetTablet
for that alias. That scenario is handled differently in v19+ so I added explicit handling for it in this fix.